Validity of outcome measures used in randomized clinical trials and observational studies in degenerative lumbar spinal stenosis

It is unclear whether outcome measures used in degenerative lumbar spinal stenosis (DLSS) have been validated for this condition. Cross-sectional analysis of studies for DLSS included in systematic reviews (SA) and meta-analyses (MA) indexed in the Cochrane Library. We extracted all outcome measures for pain and disability. We assessed whether the studies provided external references for the validity of the outcome measures and the quality of the validation studies. Out of 20 SA/MA, 95 primary studies used 242 outcome measures for pain and/or disability. Most commonly used were the VAS (n = 69), the Oswestry Disability Index (n = 53) and the Zurich Claudication Questionnaire (n = 22). Although validation references were provided in 45 (47.3%) primary studies, only 14 validation studies for 9 measures (disability n = 7, pain and disability combined n = 2) were specifically validated in a DLSS population. The quality of the validation studies was mainly poor. The Zurich Claudication Questionnaire was the only disease specific tool with adequate validation for assessing treatment response in DLSS. To compare results from clinical studies, outcome measures need to be validated in a disease specific population. The quality of validation studies need to be improved and the validity in studies adequately cited.

Degenerative lumbar spinal stenosis (DLSS) is defined by diminished space for the neural and vascular elements in the central canal of the lumbar spine secondary to degenerative changes of the facet joints, ligaments, vertebrae, and intervertebral discs 1,2 . DLSS is a common disease in elderly patients and typically presents with neurogenic claudication symptoms including pain in the buttocks and lower extremities provoked by walking or extended standing and relieved by rest and bending forward 3 . The treatment options range from nonsurgical approaches such as analgesics, physiotherapy, and epidural corticosteroid injections to surgical methods.
In the past, a multitude of studies assessed the effects of these treatment options for DLSS. In order to be able to establish firm and stringent evidence-based clinical guidelines on the cost-effective use of treatment interventions, results based on clinical trials need to be compared. This is particularly important in systematic reviews and meta-analyses where conclusions are based on the available studies 4 . However, many trials use different outcome measures which complicate the comparison of trial results. Further, studies may use measures that were not validated in the DLSS population and therefore, may not identify clinically relevant changes or differences in this patient population. Indeed, one study showed that depending on the outcome measure that was used and the cut-off values for clinically important improvement, the conclusion of a study may be strongly Quality of validation study. Two reviewers (DR and MW) analyzed the methodological quality of the validation process using the COnsensus-based Standards for the selection of health status Measurement Instruments (COSMIN, https:// www. cosmin. nl/ tools/ check lists-asses sing-metho dolog ical-study-quali ties, assessed on December 2, 2022) checklist 8 . The COSMIN checklist was developed to assess the methodological quality of studies on measurement properties of health-related patient reported outcomes. We extracted information on eight domains: the content validity, internal consistency, construct validity, criterion validity, reliability, responsiveness, flooring/ceiling effect, and interpretability.
Interpretability Was the degree to which one can assign qualitative meaning to quantitative scores assessed (anchor-based method recommended, to determine the minimal clinical difference; sample size ≥ 50 patients)?
Two reviewers (DR and MW) independently assessed each domain and rated the domain as fulfilled (+ , defined as very good or adequately addressed), not fulfilled (-, doubtful or inadequate), not applicable (NA), and nor reported (NR). Disagreement between the reviewers were discussed and resolved by consensus. In case no consensus could be reached, the study was discussed with a third reviewer (FB). All disagreements were resolved Characteristics of the included primary studies. The characteristics of the included primary studies are summarized in Table 2. Most of the studies were randomized controlled trials (n = 50, 48.5%) and prospective cohort studies (n = 34, 35.8%). Almost three quarters (73%) of the primary studies involved at least one surgical intervention. Studies were published between 1983 and 2016.
The primary studies included a total of 7′878 participants with a median age of 63.5 ± 7.1 years (range 44-76.2 years). The median follow-up duration was 78.1 ± 81.3 weeks (range 1-480 weeks). Table 3 summarizes the outcome measures used in the primary studies. In total, 242 outcome measures were identified. In the domain of pain four outcome measures were detected. The Visual Analogue Scale (VAS, n = 69, 90%) respectively Numeric Rating Scale (NSR, n = 9, 9%) were most commonly used. In the domain of disability, a total of 12 outcome parameters were identified. The Oswestry Disability Index (ODI, n = 53, 47%) and various tests assessing walking tolerance (n = 34, 29%) were mostly used (walking ability 9-11 , pain free walking 12 , walking distance  , walking test 38 , walking time 39 , walking < 15 minutes 40 , walking tolerance 41 ).
Outcome measures and reference studies. In total, 45 primary studies (47.3%) provided a reference for at least one outcome measure. In the domain of pain references were provided for the VAS (n = 5) and the NRS (n = 2), respectively. In the domain of disability, the ODI (n = 22) and the Roland Morris Disability Ques- www.nature.com/scientificreports/ tionnaire (RMQ, n = 8) were most frequently referenced. In the domain of pain and disability combined the ZCQ (n = 14) was commonly referenced. For nine outcome measures (disability n = 7, pain and disability combined n = 2) a total of 14 validation studies specifically for a DLSS population were found. For the ZCQ (n = 4) 42-45 and the ODI (n = 3) 43,46,47 more than one validation study was identified. For details see Table 4.  Table 4). Twelve of the included 14 studies reached a quality score of 3/8 or less, indicating low methodological quality. None of the validation studies reached the score maximum (range 2/8-7/8). The two studies by Stucki et al. 44,45 assessing the validation of the ZCQ in DLSS population, achieved the highest scores (6/8 respectively 7/8). The Beaujon scoring system (BSS) and various tests assessing walking tolerance were tested in a DLSS population. However, the methodology of the validation study was not in agreement with the methodological items proposed for measurements of health-related patient reported outcomes 8 .

Discussion
Main findings. The results of this cross-sectional analysis indicate the reporting of outcome measures in randomized clinical trials and observational studies in DLSS is insufficient. Less than half of the included primary studies provided a reference for at least one outcome measure in the domain of pain, disability, or combined pain and disability. A total of 14 validation studies for nine outcome measures were found. The quality assessment of the validation studies revealed low quality for the majority of the studies. Within the DLSS population three validation studies were found for the ODI and four validation studies for the ZCQ, respectively. However, all three validation studies for ODI scored unsatisfactory in the quality assessment. Based on this study, the ZCQ represents the only disease specific tool with adequate validation for assessing treatment response in DLSS.   www.nature.com/scientificreports/ of the instrument itself, but also on the context in which it is used 50 . Web-based systems such as PROMIS have been developed from efforts to optimize and simplify the process of selecting an appropriate measurement instrument 51 . The stated goal is to provide well-constructed, generalizable, and clinically relevant endpoints for studies 52 . These systems facilitate the completion of questionnaires for subjects, as otherwise there would be a considerable administrative burden. In 2006, the North America Spine Society (NASS) Compendium for the Assessment and Research of Spinal Disorders recommended the Quebec Back Pain Disability Scale, the Roland Morris Disability Questionnaire, and the Waddell Nonorganic Signs for lumbar pain as measurement tools 53 . In contrast to lumbar back pain, there are currently no specific recommendations for the use of measurement tools in DLSS 54 . However, measurement tools that are valid for patients with nonspecific back pain do not necessarily measure the relevant endpoints for patients with DLSS. The latter have a different clinical presentation with typical claudication symptoms. Consequently, depending on the conception and design of a questionnaire, clinical outcomes may vary significantly 5 . The variance of measured symptoms can vary widely, as shown in a recently published study 55 . The comparison of measurement instruments in patients with DLSS showed that there was a variability of 40-70% depending on cut-off and measurement instrument. In a recently published study 56 , the ZCQ was the most responsive tool to assess symptoms and function in DLSS supporting the findings of the current systematic analysis. The use of non-validated, nonspecific measurement instruments in studies has an impact on future clinical decisions. The extent of this variation was relevant enough to lead to completely different interpretations of a study. Kimberlin et al. 57 argue that although any outcome of a measurement instrument is only an approximation of the actual truth, the use of non-validated measurement instruments has the same effect on study quality as a poor study design or an insufficient number of patients. Our study shows that many Fairbanks et al. 46 Stucki et al. 45 Whitehurst et al. 164 Whitehurst et al. 164 Whitehurst et al. 164 Tomkins et al. 163 Pratt et al. 43 Stucki et al. 44 Stucki et al. 45 Pratt et al. 43 Comer et al. 42 Lassale et al. 167 Number of  participants  52  45  550  193  123  123  123  33  52  193  193  52  99  www.nature.com/scientificreports/ of the measurement tools used have not been validated in DLSS patients and it is therefore unclear whether they represent what is relevant to patients. The issue of inclusion of a magnitude of different outcomes in trials of the same intervention is not novel. For example, in their systematic review from 2017 Mayo-Wilson et al. 58 identified variation in outcomes across reports of RCTs the effect of gabapentin for treating neuropathic pain and quetiapine for bipolar depression, respectively. The authors found that the RCTs included hundreds of outcomes and concluded that researchers may cherry-pick what they report from multiple source of RCT information. This results in challenges for interpreting clinical trials and obstacles in comparing clinical trials in meta-analyses.
The development of a measurement instrument involves testing validity and reliability with a defined target population 49 . Choosing a measurement instrument wisely can be challenging given the growing number of choices available. In recent years, various efforts have been made to systematically assess the validity of measurement instruments 59 . Meaningful use of a measurement instrument depends not only on the validity of the instrument itself, but also on the context in which it is used 50 Web-based systems such as PROMIS have been developed from efforts to optimize and simplify the process of selecting an appropriate measurement instrument 60 The stated goal is to provide well-constructed, generalizable, and clinically relevant endpoints for studies.

Strength and limitations.
To the best of our knowledge this is the first cross-sectional analysis of outcome measures used in randomized clinical trials and observational studies in DLSS. In addition, we conducted a validity check of the outcomes applying existing guidelines for conducting systematic literature reviews 51 .
As we focused on systematic reviews and meta-analyses, it is potentially possible that individual studies may not be identified in our analysis. However, we are confident that our methodology included the most relevant papers. The main limitation of this study is that this approach did not capture all validation studies conducted to date. To include an overview of all validation studies ever conducted in patients with DLSS would require a systematic review. By using complete sets of studies included in SR and MA, we assessed the quality of reporting of validation studies and the quality of the validation studies themselves. Therefore, we did not aim to provide a complete overview for all validation studies conducted in DLSS. Thus, when included in this systematic literature review, a study underwent two selection processes.
Implications for clinical research. In order to assess the effectiveness of treatment studies in patients with DLSS, valid and comparable measurement instruments are central. Our study shows that many different and partly unvalidated instruments are used. In addition, there is a lack of information on the minimal clinically important change of the respective measurement instruments. Researchers should systematically conduct high quality validation studies for the measurement instruments in DLSS patients. In addition, the patients' perspective should be included in the selection of measurement instruments. Further validation studies of measurement instruments specific for DLSS patients with at least 50 patients and considering the quality criteria of Terwee et al. 61 will help to quantify the symptoms relevant for DLSS patients and thus have a direct impact on the validity of future RCTs and OS.
Implications for clinical practice. Increasingly, patient-centered measurement instruments are recommended or required for measuring treatment outcome. Our study shows that the selection of adequately validated measurement instruments for DLSS patients is important and that many measurement instruments are not validated in this patient population. In particular, reliable and valid questionnaires specific to DLSS are helpful for everyday clinical practice, as clinical progress can be monitored and responses are less influenced by the treating individuals. For monitoring treatment response in DLSS, we believe that ZCQ provides the most differentiated results. In particular, this questionnaire has the advantage of combining the assessment of pain, satisfaction and disability at the same time.

Conclusion
Reporting of the validity of outcome measures was poor and only in validation in one outcome measure was adequate. In order to be able to compare results from clinical studies, outcome measures need to be validated in a disease specific population and external validation studies should be indicated adequately. For monitoring treatment response in DLSS, the use of the ZCQ is recommended.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. www.nature.com/scientificreports/